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Deterministic finite automata (DFAs) are constructed for various purposes 
in computational biology. Little attention, however, has been given to the 
efficient construction of minimal DFAs. In this article, we define simple non- 
deterministic finite automata (NFAs) and prove that the standard subset 
construction transforms NFAs of this type into minimal DFAs. Furthermore, 
we show how simple NFAs can be constructed from two types of patterns 
popular in bioinformatics, namely (sets of) generalized strings and (general- 
ized) strings with a Hamming neighborhood. 

1 Introduction 

Deterministic and non-deterministic finite automata belong to the curriculum of every 
theoretical computer scientist. It is well known that, given a non-deterministic finite 
automaton (NFA), we can construct a deterministic finite automaton (DFA) recognizing 
the same language by employing the classical subset construction; each state in the 
resulting DFA corresponds to a set of NFA states. The details can be found in many 
textbooks on the topic, for example in [21 El [18]. If Q is an NFA's finite state space, then 
there are 2'*^' subsets and hence the same number of DFA states. In most cases, many 
of these states turn out to be inaccessible from the start state and can be discarded. In 
practice, we can use a construction scheme that only generates the accessible states by 
performing a breadth-first search on the state space For each DFA, there exists a 
unique (up to isomorphism) minimal DFA that accepts the same language Following 
the subset construction, we may thus want to minimize the resulting DFA, for example 
by using Hopcroft's algorithm [21 [S]- 

In computational biology, the processing of sequences plays a prominent role. Se- 
quences of nucleotides (DNA or RNA) and amino acids (proteins) are key players in the 
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biology of cells. Recurring elements in such sequences, called patterns or motifs, can 
often be associated with biological function [H [16] . Three important problem fields in 
connection with motifs are those of motif search p[T], motif statistics fTEj dU |T2l El E] 
and motif discovery [201 El 113 EH] • Not surprisingly, in many algorithms in these fields, 
motifs are transformed into deterministic automata recognizing all possible instances of 
the motif. Motivated by this observation, we explore the construction of minimal DFAs 
for two common motif classes, namely (sets of) generalized strings and consensus strings 
with a Hamming neighborhood. Ultimately, the goal is to find algorithms whose runtime 
depends linearly on the number of states of the minimal DFA (which would be optimal). 
Although automata theory has been subject to extensive research for decades, not much 
attention has been given to this particular topic. Recently in 2008, van Glabbeek and 
Ploeger [21] addressed the problem of determinization and integrated minimization. In 
Section 13. Ij we discuss the connections between their work and this article. 

Our contributions We identify a class of NFAs that directly result in minimal DFAs 
when subjected to the classical subset construction. Although the concept is quite simple 
and seemingly restrictive, we show that it is strong enough to cover many patterns found 
in computational biology. To this end, we give construction schemes to transform (sets 
of) generalized strings and consensus strings with a Hamming neighborhood into NFAs 
which exhibit this property. 

The article is organized as follows. First, we establish notation by briefly re-stating 
textbook definitions of automata in Section [2l Then, in Section [3l we introduce the 
concept of simple NFA and show that applying the subset construction to a simple NFA 
directly yields a minimal DFA. The theory is put to work in Sections H] and where 
we discuss the construction of minimal DFAs from generalized strings and consensus 
strings, respectively. 

2 Notation and Basic Definitions 

Let S be a finite alphabet and let S*^ be the set of all strings of length k. Then, the 
set of all finite strings IJi^o denoted S* and IJi^i is denoted S+. For a string 
s G E*, its length is written and S1S2 denotes the concatenation of si and S2- The 
only string e G S* such that \e\ = is called empty string. By s[i], we refer to the i-th 
character of s, i.e. s = s[l]s[2] . . . s[\s\]. Furthermore, s[i,j] := s[i]s[i + 1] . . . s[j] refers 
to a substring of s. U i > j, we define s[i,j] := e. Prefixes and suffixes of s are written 
s[..i] := s[l,i] and s[i..] := s[i, \s\], respectively. 

We can extend the notion of a string in a natural way by allowing a generalized string 
to be a sequence of sets of characters: 

Definition 1 (Generalized string). Given an alphabet S, we call the set Q-£ := 2^ \ {0} 

generalized alphabet over S and a string over generalized string. By and Q^, we 
refer to the set of all generalized strings of length k and the set of all generalized strings 
of finite length, respectively. We say a string s G S* matches the generalized string 
g G Q^, written s < g, if \s\ = \g\ and s[i] G g[i] for 1 < i < \g\. 
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We write Q instead of Qy, if the used alphabet is clear from the context. Note that 
every string s G S can be translated into the generalized string {s[l]}{s[2]} . . . 
In this sense, strings can be seen as special cases of generalized strings. Let us now 
proceed to the classical definitions of automata. 

Definition 2 (Deterministic finite automaton (DFA)). A deterministic finite automaton 
is a tuple (Q, S, 5, g^, F), where Q is a finite set of states, E is a finite alphabet, 5 : 
Q X S — 7- Q is a transition function, G Q is the start state, and F C Q is the set of 
accepting states. 

Definition 3 (Non-deterministic finite automaton (NFA)). A non- deterministic finite 
automaton is a tuple (Q, S, A, Qa, -F"), where Q, S and F are defined as for the DFA 
above, A : Q x E — )■ 2*^ is the non- deterministic transition function and Qa C Q is a 
set of start states. 

Note that using a set instead of only one start state is a notational convenience 
rather than a conceptual change: we can always transform the automaton to have only 
one start state by adding the start state qa and defining its outgoing transitions by 

Another convenience is the extension of a DFA's transition function to strings (instead 
of single characters): 

5 : g X E* ^ Q 

1 (5(5(g, s[2..]) otherwise. 

Analogously, the transition function A of an NFA can be extended to A. Furthermore, 
we define £(g) := {s G E* | h.{q, s)nF ^ 0} and call it language of state q. The language 
of a set of states Q' is defined as C{Q') := \Jqi(zQt C{q'). Following [1], we call a state 
q E Q accessible, if there exist a string s G S* and a start state ^ Qa such that 
^{qa,s) = q. A state g G Q is called coaccessible if there exist a string s G S* and an 
accepting state qf E F such that A(g,s) = g/. Equivalently, g G Q is coaccessible if 
£(g) n F 7^ 0. If all states of an automaton are accessible and coaccessible, it is called 
trim. 

Let us briefly review the classical textbook construction of a DFA recognizing the 
same language as a given NFA. 

Lemma 1 (Subset Construction; Rabin and Scott, [I3]). Let M = {Q,'E, A,Qa, F) be 
an NFA. Then (2^, E, S, Q„, {Q' G 2«|g' n F ^ 0}), with S : {Q', a) ^ J^'eg' ^(q', 
is a DFA that recognizes the same language as M. 

Proof. Omitted. See [13] or □ 

As mentioned above, some DFA states may be inaccessible. These states can be re- 
moved from the DFA's state space. To ease notation, we write SubsetConstruction(M) 
to denote the DFA resulting from the subset construction and subsequent removal of 
inaccessible states. 
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3 Simple NFAs 



Recall that our goal is to identify a class of NFAs for which the subset construction 
yields a minimal DFA; where a DFA is called minimal if there does not exist a DFA with 
fewer states that recognizes the same language. To this end, we define simple NFAs. 

Definition 4 (Simple non-deterministic finite automaton). Let an NFA M = {Q, S, A, q^, F) 
be given. M is called simple if all states are accessible and the languages C{q) of all 
states q E Q are non-empty and pairwise disjoint. 

Therefore, an automaton is simple if and only if it is trim and the languages of all 
states are pairwise disjoint. Note that an automaton can easily be made trim: If there 
is a state q that is not coaccessible, that is, C{q) is empty, we can safely remove q from 
Q without changing the recognized language. Likewise, all inaccessible states can be 
removed without changing the recognized language. 

Theorem 1 (Minimality of DFA constructed from simple NFA). Let M„ = (Q, S, A, Q„, F) 
be a simple NFA. Then, the DFA 

Md= {Q<Z 2'^,J:,6,Qa,J^) = SubsetConstruction(M„) 

is minimal. 

Before we are able to prove this, we need an auxiliary lemma and the notion of 
equivalent states in a DFA. We define two states p and g of a DFA (Q', S', 6', q'^, F') to 
be equivalent if 5'{p^ s) G F' <^=^ (5'(g, s) G F' for all s G S*. 

Lemma 2. A DFA is minimal if and only if its states are pairwise non-equivalent. 

Proof. See Chapters 13 and 15 in |6]. □ 

Proof of TheoremUl Let Q', Q" G Q be two distinct DFA states. By Lemma [H we have 
to show that Q' and Q" are not equivalent, or more formally 

LiQ') = U C{q') ^ U C{q") = cm . (1) 

q'&Q' q"&Q" 

Without loss of generality, assume that Q'\Q" 7^ and let q G Q'\Q" . By Definition HJ 
C{q) n C{q") = for all q" G Q" and thus £(g) H C{Q") = 0. But, by choice of q, 
c\q) C C{Q') and, by Definition il C{q) ^ 0. Hence, it follows that C{Q') ^ C{Q"). □ 

3.1 An Alternative Proof 

We give an alternative proof of Theorem [T] by means of the theory developed in [21] . 
There, van Glabbeek and Ploeger consider five different variants of the classical subset 
construction. Each variant is characterized by an operation / : 2*^ — )• 2*^, where Q is 
the state space of an NFA. When a new DFA state is produced in the course of the 
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subset construction, it is subjected to the operation / before being added to the final 
automaton. In one variant, they define / to be the closure operation 

closec -.Q' ^ {qeQ\ C{q) C C{Q')} 

and show that the subset construction endowed with this operation directly produces 
minimal DFAs. Theorem [T] now follows from the definition of simple NFAs: As all sets 
£(g) for g G Q are pairwise disjoint, closec(Q') = Q' for each Q' Q and, thus, the 
classical subset construction yields a minimal DFA. 

Note that the language inclusion problem required to be solved for the closec-operation 
is in general hard to compute. According to [21], it is PSPACE-complete. 



3.2 Self- Transitions of Start States 

In most practical settings like pattern search or pattern statistics, we are given a certain 
type of pattern and need to construct an automaton that accepts all strings with a 
suffix matching this pattern, rather than an automaton that accepts only the strings 
that match the pattern. For instance, if our pattern is the single string ABC and we 
want to find all occurrences of ABC in a long text, we need to build an automaton 
recognizing all strings whose last three letters are ABC. For NFAs, we can easily obtain 
such an automaton once we have constructed an NFA accepting all strings that match 
our pattern. All we need to do is to modify the transition function A by adding self- 
transitions to all start states 

Ao:fe,.)^/W^^'''''') ""^^^^ (2) 
I A (g, 0") otherwise . 

Throughout this article, the subscript "O" refers to this modification of a transition 
function. The next Lemma characterizes those simple NFAs that remain simple under 
this modification. 

Lemma 3. Let M = {Q, S, A, Q^, F) be a simple NFA. The modified automaton : = 
(Q, S, Aq, Qa, F) is simple if and only if, in M , no start state can he reached from any 
other state. That means there do not exist a eT^, qa E Qa, o,nd q E Q with qa ^ q such 
that qa G A(g, a). 

Proof. In this proof, we use the notation Cc,{q) to refer to the language of the state q 
with respect to the modified NFA (Q, E, A^, Qa, F). 

Suppose {Q,Ti, A0,Qa, F) is simple and there exist a G S, g^, G Qa, and 
q&Q with qa 1^ q such that qa G A(g, cr). Thus, as G C{q) for all s G C{qa)- Because 
of the added self-transition, we also have as G Ci^{qa) and, thus, CQ^qa) and C^i^q) are 
not disjoint, contradicting the assumption that Mq is simple. 

"<^=": Now, we assume that there do not exist any cr G E, € Qa, and q E Q 
with qa 1^ q such that g^ G A(g, a). The properties that all states are accessible and 
coaccessible cannot get lost by adding the additional self-transitions. Therefore, we only 
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need to verify that CQ{q) and C(^{q') are disjoint for aU distinct g, g' G Q. For the 
sake of contradiction, we assume there exist distinct q,q' E Q violating this condition. 
We choose s e >C(3(g) fl C^i^q') such that s G C(^{q) \ ^{q)', if that is not possible, it 
becomes possible after swapping q and q', because C{p) C Cq{p) for all p G Q and 
jC{q) f] JC{q') — $. We have to distinguish two cases: 

Case 1 {s G C{q')): By our assumption, there does not exist a state in Q \ Qa from 
which a start state can be reached. This means that the transition function remains 
unchanged for all states reachable from any state in Q \ Qai which implies that C{p) ~ 
Co{p) for all p G Q \Qa- Therefore, q must be a start state. We chose s to lie in 
\ C{q), which implies that there exists a A; G INI such that s[k..] G C{q). Since all 
C{p) forp G Q ai'c disjoint, it follows that s[k..] ^ C{p) for all p G Q\{q}- As s G C{q'), 
we thus conclude that A{q',s[..k — 1]) = q, which contradicts the assumption that we 
cannot reach a start state from any other state than itself. 

Case 2 {s ^ ^{qD- By the same argument as in the last case, we conclude that q and q' 
must be start states. Again, this imphes the existence of k,k' e N such that s[k..] G jC{q) 
and s[k'..] G C{q'). lik = k', then s[k..] G £(g) n£(g') 7^ 0, contradicting the simpleness 
of M. We assume, without loss of generality, that k < k' . Since s[k' ..] G C{q') and 
s[k' .] ^ i^ip) for all p G Q \ {g'}, we conclude that A(g, s[A;, /c' — 1]) = g', again 
contradicting the assumption that we cannot reach a start state from any other state 
than itself. □ 



In the next two sections, we show that generalized strings and sets of generalized strings 
admit the construction of simple NFAs. Obviously, a single string is a special case of a 
set of strings. To aid understandability, we nonetheless start with the easier case of one 
single string. 

4.1 Single Generalized Strings 

For a generalized string g, an NFA recognizing all strings that match g can easily be 
constructed by connecting the state set Q — {Q, . . . ,\g\} with the transition function 



SettingQa = {0} and F = completes the construction of our NFA (Q, S, A, Qa, -P")- 
For brevity, we write NFA(g') to denote the automaton created from a generalized 
string g using the above construction. 

Lemma 4. Let g he a generalized string. Then Mg := NFA(g') is a simple NFA. 

Proof. Clearly, all states i E Q are accessible and coaccessible. Mg admits only transi- 
tions from a state i to its successor state i + 1; only the last state in this chain is an 



4 Application to Generalized Strings 





otherwise . 
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Figure 1: Example of a simple NFA (with self-transition added to the start state) con- 
structed from the generalized string {A}{A,B}{B}{A,C} over the alphabet 
S = {A,B,C}. The accepting state is represented by two concentric circles. 

accepting state. Thus, for each state i E Q, the lengths of all accepted strings s G C(i) 
equal \g\ — i. Hence, for two different states i and j, accepted strings have different 
lengths. Thus, all C{i) must be pairwise disjoint (for i E Q). □ 

As discussed in Section 13. 2^ we often need to add a self-transition to the start state. 
This modification is defined formally in Equation (|2]). We write NFA(j(5f) to refer to the 
resulting automaton. See Figure [1] for an example. Combining Theorem [H Lemma HI 
and Lemma |3l we arrive at the following corollary: 

Corollary 1. Let g be a generalized string and Mg := NFAo((7) the corresponding NFA. 
Then, SuBSETCoNSTRUCTiON(Mg) is a minimal DFA. 

4.2 Sets of Generalized Strings 

In this section, we generalize the above results to finite sets of generalized strings of 
equal length. Speaking formally, we assume a length i and G C to be given and 
seek to construct a simple NFA that recognizes all strings that have a suffix matching a 
g E G. As above, we first construct an automaton that recognizes all strings matching 
a. g E G and, in a second step, add self-transitions to the start states Qa- 

The automaton we build is organized level- wise with i + 1 levels. Transitions are only 
possible between states in adjacent levels and only in one direction (which we choose to 
call downwards). The bottom level contains just one state which is the single accepting 
state; all states in the top level are start states. As before for a single generalized string, 
two states q' and q" in different levels are obviously "language-disjoint", meaning that 
£(g') n C{q") = 0. But here, we possibly need more than one state in a level, which 
entails the problem of ensuring language-disjointness for states in the same level. We 
achieve this by using a state space induced by a special parent-child relation between 
states in adjacent levels. Before we formally construct state space and automaton, the 
impatient reader may have a look at the example in Figure [H 

Let us begin with the formal specification of a suitable state space Q. We choose Q 
to be a special subset of Q := 2'^ x {0, ...,£} with the following semantics in mind: to 
be in state q = {H, k) means that the last k characters read match the first k positions 
of a g e H. For the definition of Q, we need the function Parent : (Q x S — > (Q U {±} 
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A,B,C 

Level 



Level 1 



Level 2 



Level 3 





Figure 2: Example of a simple NFA constructed from the three generalized strings 
0:{B,C}{A,C}{A,B}, 1:{A}{B}{A,B,C}, and 2:{C}{B,C}{A,C} over the alpha- 
bet S = {A,B,C}. Each state is annotated with the set of generalized strings 
that are "active" in this state (each generalized string is represented by its 
index 0, 1, or 2). The accepting state is represented by two concentric circles. 



given by 



PARENT:((/f,*),.)^|(f'^*l^^''W>'*-') (3) 

_L otherwise . 



We say that PARENT(g, a) is a parent of q under the character a. The special symbol _L 
is used to indicate that a state is in the top level and therefore does not have any parents. 
The Parent mapping induces a hierarchy of £ + 1 levels of states: 

Qe:={{G,i)}, (4) 
Qi -.^[{H, i), e 2^ \ {0} I 3g e Qi+i, (7 e E : PARENT(g, a) = {H, i) } , (5) 

for < i < i. Finally, we write our state space as 

Q:=QoU...UQe. (6) 
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The Parent mapping also induces a transition function A: 

A : ({H, k),a)^H'^^ ' PARENT(g, a) = {H,k)] iik<l^ 

1 otherwise . 

To complete the construction, we set Qa '■= Qo and F := Qe = {{G,i)} and obtain 
NFA(G') := (Q, S, A, Qa, -F). The next lemma states that an NFA constructed in this 
way accepts exactly the language given by G. 

Lemma 5. Let a length i G DM, a set of generalized strings G C , and {Q, S, A, Qa, F) = 
NFA(G) be given. Then, 

3g G Q„ : A(g,s) nF 7^ 3g E G : s < g , 

for all s G S*. 

Proof. We start with the forward direction If s G E* is accepted by NFA(G), 

then there exists a sequence of states qo, . . . , q\s\ such that go £ Qa, q\s\ ^ F, and qi G 
A{qi_i,s[i]) for < 2 < It follows from Equation ([7]) that gj_i = PARENT(gj, 
Hence, Equation implies that Hq C . . . C H\s\, where {Hi, hi) := q^. Furthermore, by 
Equation ([5]), i^o is non-empty. Inductively applying ([3]) now yields that s < h for all 
h E Ho, which proves the forward direction. 

Let us prove the backward direction ^^•^=" . Let g E G, such that s < g. Consider 
the sequence of states q'^, . . . , q'^^^ with (if-, k'j) := q[ given by g'^i := {G, i) and := 
PARENT(g-, for < i < From s < g and Equation ([3]) it follows that g E H[ for 

< i < Thus, each H[ is non-empty and by Equations (j4j) and ([5]) we get q[ E Qi 
for < i < |s|, implying that q'o E Qo = Qa is a start state. From Equation ([7]) we 
conclude that A(gQ, s) = q'^^^ which proves the claim as E Qi = F. □ 

In analogy to Lemma HI we verify that NFA(G) is indeed a simple NFA. 

Lemma 6. Let i E M and G G . Then, Mq := NFA(G) is a simple NFA. 

Proof. The level-wise construction directly implies that all states are accessible and 
coaccessible, i.e. is non-empty for all q E Q. States with empty cannot be 
generated by Equation 

It remains to be shown that for all distinct p,q E Q the sets C{p) and C{q) are disjoint. 
By construction, this is clearly true if p and q are in different levels. Hence, it suffices 
to show that 

C{p) n C{q) = for all p,qEQi with p^q (8) 

for all Qi with < i < i. We prove this by induction on i. First, note that for 

1 = £, Condition (|8]) is fulfilled as \Qe\ = 1. Assume that ([8]) holds for i > 0. For 
the sake of contradiction, we further assume there exist distinct p,q E Qi-i, such that 
C{p) n £(g) 7^ 0. Let s E C{p) fl C{q); it follows that A{p,s) E F. There must exist a 
state r G Qi such that A(r, s[2..]) G F. As, by our induction hypothesis. Condition (|8]) 
holds for i, we conclude that the state r is unique. It follows from ([7]) that r G A(p, s[l]) 
and r G A(g, s[l]). Applying the definition of A, we get p = PARENT(r, = q and, 
thus, p = q. □ 
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In Section I4.H we added an initial self-transition to the constructed NFA in order to 
accept not only the given generalized string, but all strings whose suffix matches the 
generalized string. We thereby obtained an automaton that finds all occurrences of the 
generalized string in a given text. Now we repeat this step by transforming NFA(G') 
using Equation (|2]). Again, we refer to the resulting modified automaton by NFA<j(G'). 
Note that for IG*! = 1 we obtain the same automaton as constructed in Section 14.11 
Combining Theorem [H Lemma [6l and Lemma [3] yields the following corollary: 

Corollary 2. LetieN,G C G^, and Mq := NFAo(G). 

Then, the result o/ SubsetConstruction(Mg) is a minimal DFA. 

4.2.1 Algorithm and Runtime 

The construction scheme formalized in Equations (jlj) and can directly be translated 
into an algorithm: 

1. Initialize transition map A to be empty. 

2. Initialize the bottom level Qe to contain its only state {G,i). 

3. For k from i — 1 down to 0, build level Qk'- 

a) Initialize level Qk to be empty. 

b) For each node {H', k + 1) E Qk+i and each cr G S 

i. Compute the set H := [h e H' \ a e h[k + 1]}. 

ii. If ^ and {H, k) i Qfe, add {H, k) to Qk. 

iii. Add transition ((if, k),a) ^ {H', A; + 1) to A. 

4. Add self-transitions to all g G Qo- 

In Loopini 'w^G build i levels. Each level contains at most 2'*^' states and thus the body 



of Loop [3b] is executed C(2''^l ■ |S|) times for each level, where Step 3(b)i takes CdGI) 
time and the other steps can be performed in constant time. All in all, the algorithm 
takes (9(21^1 |S| ■ |G|) time. 

The construction of a minimal DFA from a set of generalized strings thus takes 0{2^^^ ■ 
£ ■ ISI ■ IGI + m) time, where m is the number of states in the minimal DFA. 



5 Application to Consensus Strings with a Hamming 
Neighborhood 

Another type of motif commonly used in computational biology is a consensus string 
along with a distance threshold. Here, we assume that a (generalized) string s and a 
distance threshold (imax are given and want to compute the minimal DFA that recognizes 
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Figure 3: Example of a simple NFA over the alphabet S = {A,B,C,D} recognizing the 
consensus ADC and all strings within a Hamming distance of two or less. Char- 
acters with bars stand for the inverse, e.g. A stands for B, C, or D. The accepting 
state is represented by two concentric circles. 



all strings with a Hamming distance to s of at most c/max, where the Hamming distance 
between a string s and a generalized string g of same length is defined as 



d{s,g) := ||z G {1, . . . , | s[i] ^^[z]} 



In this section, we construct a simple NFA recognizing a generalized string and its Ham- 
ming neighborhood. The construction is similar to the one given in [TT]. Interestingly, 
the resulting NFA turns out to be simple. 

The basic idea for the construction is to use a two-dimensional grid as a state space, 
where we advance into one dimension whenever a valid character has been read and into 
the other dimension for each mismatch. Figure [3] illustrates an NFA built in this way. 
Formally the state space is defined by 



Q := |(e,A;) E {0, . . . , rf^ax} x {0,...,|^|} 



1^1 



k-e>0 



(9) 
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with the following semantics: state (e, k) accepts all strings of length \g\ — k that match 
the respective suffix of g with exactly e errors. The condition \g \ — k — e > states that 
the number of errors e cannot be larger than \g\ — k, which is the number of characters 
left. We define the transition function to obey this semantics: 

1 2;(e — 1, A; + 1) otherwise, 

where the function 2; : Z x Z — )■ 2*^ returns the empty set whenever we "fall off the grid" . 
More precisely, 

I otherwise . 

As before, the topmost level constitutes the start states, i.e. Qa := {{e, k) ^ Q \ k = 0^ , 
and the bottommost level contains only the single accepting state, i.e. F := {(0, \g\)}- 
We write NFA((yf, (imax) := {Q, S, A, Qa, F) to denote the NFA constructed in this way. 
Again, we use the notation NFAo(5', (imax) := {Q, A^, Qa, F) to refer to the automa- 
ton with self-transitions added to the start states. Note that for dmax = 0, the resulting 
automaton is isomorphic to the one constructed from a single generalized string in Sec- 
tion O 

In order to prove that the construction is correct and produces simple NFAs, we use 
the following Lemma on the state's languages. 

Lemma 7. Let g e Q^, dmax e Nq and M = NFA(5f,dmax) = (Q, S, A, Q^, F). Then, 
the language of state (e, k) is characterized by 

C{{e,k)) = |s G Sl^'l-'^ d{s,g[k+l..]) = e} , 

for all (e, k) G Q. 

Proof. We start with the direction "C" . By construction of A and F, we have >C((e, A;)) C 
5]l9h's_ Lgi; g g £((e,/c)), then A((e,/c),s) = (0, That means, in the course of 
\s\ state transitions the first component of the state changes from e to 0. As we see 
from Equation ffTOj) . the only change possible in the first component is a decrease by 1, 
which happens if and only if the read character is a mismatch. Thus, it follows that 
d{s,g[k + l..]) =e. 

Now we prove the backward direction "D". Let s G T^^^l-k g^,^^ d[s,g[k + 1..]) = e. 
That means there are exactly e indices such that s[ai] ^ g[k + ai] for 1 < i < e. 

Provided that all states exist and thus the z function never returns 0, we apply the first 
case of ( ITU]) exactly |s| — e times and the second case exactly e times, ending in state 
(0, \g\) as claimed. The only thing left to verify is that z indeed never returns 0. Note 
that, by ffTOj) . the term \g\ — k — e cannot increase. Since it reaches zero after \s\ 
steps, it cannot have been smaller than zero at any time. Hence, by Equation (Q, all 
intermediate states exist and, thus, the first case of Equation f lTTj) is applied for all state 
transitions. □ 
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Using this lemma, the construction's correctness is easily verified: 

Lemma 8. Let g G rf^ax e BMo and M = NFA{g,d„,^^) = (Q, S, A, F). Then, 
M accepts exactly the strings {s E S'^' | d{s,g) < rfmax}- 

Proof. By definition, M accepts the strings C{Qa)- By construction of Qa and Lemma[71 
we obtain 

min(dmax,|9|) min(dniax,|g|) 

C{Qa)= U ^((e,0))= U {se^\^\\d{s,g) = e} 

e=0 e=0 

□ 

Lemma 9. Let g G Q^, (imax £ l^o- Then, NFA((7, c/max) = {Q, S, A, Qo, F), is a simple 
NFA. 

Proof. By construction, all states are accessible and coaccessible. The disjointness of 
C{q) and C{q') for distinct q,q' E Q follows immediately from Lemma [71 □ 

In analogy to Sections 14.11 and 14.2^ we can now add self-transitions to the start states 
to obtain NFA^lg , dj^^x) ■ Note that, again, the conditions of Lemma [3] are satisfied, 
allowing us to apply Theorem [TJ 

Corollary 3. Let g G Q^, cimax £ ^o, and M = NFAo(5f, dmax)- Then, the result of 
SubsetConstruction(M) is a minimal DFA. 

The state space of NFA0((7, (imax) has a size of 0{\g\ ■ (imax)- Deriving a construction 
algorithm that uses 0{1) time per state is straightforward. We can, therefore, con- 
struct the minimal DFA from a generalized string g and the distance threshold (imax in 
time 0{\g\ ■ (imax + '^), where m is the size of the minimal DFA. 

6 Conclusions 

We introduced the concept of simple NFAs. These automata have a useful property: 
when subjected to the standard subset construction, they result in minimal DFAs. Mo- 
tivated by a background in bioinformatics, we turned our attention to pattern classes 
found in this field. We gave an algorithm to construct a simple NFA from a set G of 
generalized strings of equal length i in time 0(21*^1 ■ £ ■ |S| ■ \G\). Interestingly, this 
result suggests that the difficulty in dealing with sets of generalized strings stems from 
the size of the set rather than from the length of the strings. For motifs given in the 
form of a single (generalized) string g along with a Hamming neighborhood bounded by 
a distance threshold (imax? we presented an algorithm that constructs a simple NFA in 
time 0{\g\-d^ax). A third important class of motifs are position weight matrices (PWMs) 
with a score threshold [l£j. Such a motif could be transformed into a set of generalized 
strings, which in turn could be handled by the presented algorithm. Nonetheless, a 
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more direct method to construct a simple NFA from a PWM is desirable and should be 
subject of future research. 

In this article, we demonstrated that, for the considered pattern classes, a minimal 
DFA can be constructed directly, that is, without the intermediate step of a non-minimal 
DFA. A question we did not address, regards the size of the constructed minimal au- 
tomata. In practice, we might still be faced with an exponential blow-up in the number 
of states. Thus, on the practical side, this study should be complemented by experiments 
measuring automata sizes and runtimes for typical motifs in future work. 
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